170 ◾ Bioinformatics
For indexing the human genome, STAR may require more than 32GB of RAM and it may
take a long time depending on the computer RAM and the number of processors used.
Once indexing has been created, you can align the reads in FASTQ files to the reference
genome sequence. Since the practice sequence data is paired end, there are two FASTQ files
(r1 and r2) for each sample. STAR requires these two files as inputs in addition to the refer-
ence genome index we created above. Instead of running STAR for each sample, we can use
bash script as follows to align the reads of the sample files in “fastq” directory:
mkdir starout
mkdir bam
cd fastq
for i in $(ls *.gz | rev | cut -c 13- | rev | uniq);
do
STAR --runThreadN 4 \
--runMode alignReads \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand zcat \
--genomeDir ../indexes \
--outFileNamePrefix ../starout/${i} \
--readFilesIn ${i}_r1.fastq.gz ${i}_r2.fastq.gz
done
for f in $(ls *.bam | rev | cut -c 30-|rev);
do
mv ${f}Aligned.sortedByCoord.out.bam ../bam/${f}.bam
done
cd ..
In the above, first we created a directory “starout” for STAR output and “bam” for the BAM
files. The Linux bash command “$(ls *.gz | rev | cut -c 13- | rev | uniq)” lists the names of
the FASTQ files found in the directory and cut the common name (ID) for the files of each
sample. The “for loop” will provide that common name as a variable “${i}” to assemble the
FASTQ file names and the output file name prefix to the STAR command each time until
all FASTQ files are processed. There will be output files including alignment log files and a
BAM file for each sample (6 BAM files for all example data). The second “for loop” renamed
the long BAM file names and moved the files to the “bam” subdirectory.
The bash scripting comes in handy whenever there is a task that needs to be repeated
multiple times.
The STAR output log files provide important statistics about the read alignment. You
can refer to STAR manual to read about the output log files.
Before moving to the next step, you need to index the BAM files using “samtools index”
command as follows:
#index bam
cd bam
for i in $(ls *.bam);